Rethinking resampling in the particle filter on 
graphics processing units 
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Abstract — Modern parallel computing devices such as the 
graphics processing unit (GPU) have gained significant traction 
in scientific computing, and are particularly well-suited to data- 
parallel algorithms such as the particle filter. Of the components 
of the particle filter, the resampling step is the most difficult to 
implement well on such devices, as it often requires a collective 
operation, such as a sum, across weights. We present and compare 
a number of resampling algorithms in this work, including rarely- 
used alternatives based on Metropolis and rejection sampling. 
We find that these alternative approaches perform significantly 
faster on the GPU than more common approaches such as the 
multinomial, stratified and systematic resamplers, a speedup 
attributable to the absence of collective operations. Moreover, 
in single-precision (particularly relevant on GPUs due to its 
faster performance), the common approaches are numerically 
unstable for plausibly large numbers of particles, while these 
alternative approaches are not. Finally, we provide a number of 
auxiliary functions of practical use in resampling, such as for the 
permutation of ancestry vectors to enable in-place propagation 
of particles. 

Index Terms — Particle filter, sequential Monte Carlo, state- 
space models, resampling, graphics processing unit 



I. Introduction 

For some sequence of time points t = 1, . . . , T and obser- 
vations at those times y 1; . . . , yy, the particle filter 0], 
uses N weighted samples, or particles, to recursively estimate 
the time marginals p(x t |yi :t ) of the latent states xi, . . . ,xy 
of the state-space model 

T 

P(x :T,yi:T) = P( X o) II PiVt l X * M X t l X *-l ) > C 1 ) 

t=l 

depicted in Figure [T] 

Pseudocode for the simplest bootstrap particle filter HI is 
given in Code [T] The initialisation, propagation and weighting 
steps are readily parallelised, being independent operations on 
each particle x. l t and its weight w\. Resampling, on the other 
hand, is a collective operation across particles and weights, so 
that parallelisation is more difficult. It is here that the present 
work concentrates. 

The resampling step can be encoded by a randomised 
algorithm ANCESTORS that accepts a vector, w t _i G R N , of 
particle weights and returns a vector, a f , of integers between 
1 and N, where each a\ is the index of the particle at time 
t — 1 which is to be the ancestor of the ith particle at time 
t. Alternatively, the resampling step may be encoded by a 
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Fig. 1. The state-space model. 



Code 1 Pseudocode for the bootstrap particle filter. 

PARTICLE-FlLTER(iV G N+,T G N ) 

1 foreachi G {1, . . . ,N} 

2 
3 
4 
5 
6 

7 



Xg ~ p(xo) // initialise particle i 
w' l l/N II initialise weight i 
fort = 1, . . . ,T 

& t 4- ANCESTORS(w t _i) // resample 
foreachi g {1, . . . ,N} 

x| ~ p(xt\x-t-i) H propagate particle i 
w\ 4— p{yt\x\) II weight particle i 



randomized algorithm OFFSPRING that also accepts a vector, 
w t _i G M. N , of particle weights, but instead returns a vector, 
o t , of integers between between and N, where each o\ is 
the number of offspring to be created from the ith particle at 
time t — 1 for propagation to time t. As we shall see, each 
resampling algorithm more naturally takes one form or the 
other, and ancestry vectors are readily converted to offspring 
vectors and vice-versa (jpII]provides functions to achieve this). 

Following the resampling step is the propagation step. In 
the implementation of this there are two options in arranging 
memory. The first is to have an input buffer holding the 
particles at time t — 1, and a separate output buffer into which 
to write the propagated particles at time t. The second is to 
work in-place, reading and writing particles from and to the 
same buffer. The second option is more memory efficient by 
a factor of two (for a fixed number of particles), but places 
more stringent conditions on the output of the resampling 
algorithm. It is sufficient that the ancestry vector, a t , satisfies 
ViG {!,... ,N}: 



ol>0 



(2) 



2 



With this, it is possible to insert a copy step immediately Code 2 Pseudocode for various primitive functions. 
before each propagation step, setting x l t _ 1 <— for all 



i = 1,...,N, but a\ ^ i. Each particle can then be 
propagated in-place by reading from and writing to the same 
buffer. Importantly, Q ensures that the copies can be done 
concurrently without read and write conflicts, as each particle 
is either read from or written to, but not both. The focus of 
this work is on this in-place mode, such that algorithms are 
timed up to the delivery of an ancestry vector that satisfies 
Q. We also present auxiliary functions for the permutation of 
arbitrary ancestry vectors to achieve this. 

The challenge in programming a parallel device such as 
a graphics processing unit (GPU) is that, to occupy its 
full capacity, an algorithm must be executable with tens of 
thousands of threads. Furthermore, the threads with which it 
executes are not entirely independent, grouped into teams (of 
size 32 on current hardware [3|) called warps. The threads 
of a warp should ideally follow the same path through an 
algorithm. Where they diverge to different branches of a 
conditional, or different trip counts of a loop, their execution 
becomes serialised and parallelism is lost (referred to as warp 
divergence). A further consideration is that, for all the threads 
of a warp, memory reads and writes should be from or to 
neighbouring addresses for best performance (referred to as 
the coalescing of reads and writes). Considerations such as 
these shape the development of the algorithms in this work. 
The reader is referred to the OpenCL standard [4] for general 
information, and the Compute Unified Device Architecture 
(CUDA) for more detail on optimising for NVIDIA GPUs 
in particular. 

Parallel implementation of resampling schemes has been 
considered before, largely in the context of distributed memory 
machines J6), 0. Likewise, implementation of the particle 
filter on GPUs has been considered 0, 0, (TO), although 
with an emphasis on low-level implementation issues rather 
than algorithmic adaptation as here. 

Section [II] presents algorithms for resampling based on 
multinomial, stratified, systematic, Metropolis and rejection 



sampling. Section III presents auxiliary functions for convert- 



ing between offspring and ancestry vectors, and the permuta- 
tion of ancestry vectors to satisfy Q. Empirical results for 
execution time and variance in outcomes are obtained for 
all algorithms in Section |IV] with concluding discussion in 
Section [Vj 

II. Resampling algorithms 

We present parallel resampling algorithms suitable for both 
multi-core central processng units (CPUs) and many-core 
graphics processing units (GPUs). Where competitive serial 
algorithms exist, these are also presented. Resampling algo- 
rithms are nicely visualised by arranging particles by weight 
in a circle, as in Figure [2] which may help to elucidate the 
various approaches. 

The algorithms presented here are described using pseu- 
docode with a number of conventions. In particular, we make 
use of primitive operations such as searches, transformations, 
reductions and prefix sums. Such operations will be familiar 



Inclusive-Prefix-Sum(w e R N ) -> R N 

1 W i «- £5=i w j 

2 return W 

Exclusive-Prefix-Sum(w g R n ) -> R N 



1 W l 







% = 1 



£-=V i>l 



2 return W 



Adjacent-Difference(W g R n ) -> R N 

1 W l 4- < 

\W l -W l ~ l i>l 

2 return w 



Sum(w g 



1 return J2?=i w% 

Lower-Bound(W g R n ,u g M) N+ 

1 requires 

2 W is sorted in ascending order 

3 return 

4 the lowest j such that u may be inserted into 
position j of W and maintain its sorting. 



to programmers of, for example, the C++ standard template 
library (STL) or Thrust library ifTTI . and their implementation 
on GPUs has been well-studied lfl2l . The advantage of describ- 
ing algorithms in this way is that the efficient implementation 
of these primitives in both serial and parallel contexts is well 
understood, and a single pseudocode description in terms of 
primitives will often suffice for both. Code [2] defines the 
particular primitives used throughout this work. 

We distinguish between the foreach and for constructs. The 
former is used where the body of the loop is to be executed for 
each element of a set, with the order unimportant. The latter 
is used where the body of the loop is to be executed for each 
element of a sequence, where the order must be preserved. The 
intended implication is that foreach loops may be parallised, 
while for loops may not be. 

In what follows, we omit the subscript t from weight, off- 
spring and ancestry vectors, as all of the algorithms presented 
behave identically at each time in the particle filter. 

A. Multinomial resampling 

Multinomial resampling proceeds by drawing each a 1 in- 
dependently from the categorical distribution over C = 
{1,...,N}, where P{a i = j) = w' J /Sum(w). Pseudocode 
is given in Code [3] The algorithm is dominated by the N 
calls of Lower-Bound, which if implemented with a binary 
search, will give 0(N \og 2 N) complexity overall. 

The Inclusive-Prefix-Sum operation on line [T] of Code 
[3] is not numerically stable, as large values may be added 
to relatively insignificant ones during the procedure (an issue 



(a) Multinomial 



(b) Stratified 



(c) Systematic 



(d) Metropolis 



(e) Rejection 



Fig. 2. Visualisation of the resampling algorithms considered. Arcs along the perimeter of the circles represent particles by weight, arrows indicate selected 
particles and are positioned (a) uniformly randomly in the multinomial resampler, (b & c) by evenly slicing the circle into strata and randomly selecting an 
offset (stratified resampler) or using the same offset (systematic resampler) into each stratum, (d) by initialising multiple Markov chains and simulating to 
convergence in the Metropolis resampler, or (e) by rejection sampling. 



Code 3 Pseudocode for parallel multinomial resampling. 

Multinomial- Ancestors (w e R N ) R N 

1 W <- Inclusive-prefix-sum(w) 

2 foreachi e {1,...,N} 

3 v? ~U[Q,W N ) 

4 a { <- LOWER-BOUND(W,M i ) 

5 return a 



intrinsic to any large summation). With large N, assigning 
the weights to the leaves of a binary tree and summing with a 
depth-first recursion over this will help. With large variance in 
weights, pre-sorting will also help. While log-weights are often 
used in the implementation of particle filters, these need to be 
exponentiated (perhaps after rescaling) for the INCLUSIVE- 
Prefix-Sum operation, so this does not alleviate the issue. 

Serially, the same approach may be used, although a single- 
pass approach of complexity O(N) is enabled by generating 
sorted uniform random variates lfT3l . Code [4] details this 
approach. There is scope for a small degree of parallelism here 
by dividing N amongst a handful of threads, but not enough 
to make this a useful algorithm on the GPU. A drawback is 
the use of relatively expensive logarithm functions, but we 
nevertheless find it superior to Code [3] on CPU. 

B. Stratified resampling 

The variance in outcomes produced by the multinomial 
resampler may be reduced [14] by stratifying the cumula- 
tive probability function of the same categorical distribu- 
tion, and randomly drawing one particle from each stratum. 
This stratified resampler |[T5l most naturally delivers not the 
ancestry or offspring vector, but the cumulative offspring 
vector, defined with respect to the offspring vector o as 
O = Inclusive-Prefix-Sum(o). Pseudocode is given in 
Code [5] The algorithm is of complexity 0(N). 

As for multinomial resampling, the Inclusive-Prefix- 
Sum operation on line [2] of Code [5] is not numerically stable. 
The same strategies to ameliorate the problem apply. 

Line [6] of Code [5] is more problematic. Consider that there 
may be a j such that, for i > j, u k is not significant 



Code 4 Pseudocode for serial, single-pass multinomial resam- 
pling. 



Multinomial- Ancestors (w e R N ) - 
1 W <- Exclusive-prefix-sum(w) 



pJV 



2 W <- W 



N 



W 



II sum of weights 



3 
4 
5 
6 
7 



10 
11 

12 



InMax <- 

fori = N,...,1 
u ~ W[0,1) 

InMax 4— InMax + ln(u)/i 
u <— W cxp (InMax) 
while u < Wi 

J <- j - 1 
a 1 <- j 
return a 



Code 5 Pseudocode for stratified resampling. 



Stratified-Cumulative-Offspring(w e Mr) 



pJV 



u e R N ~ i.i.d.W[0,l) 

W <- Inclusive-Prefix-Sum(w) 

foreachi e {!,..., N} 



W N 

k l 4- min (N, [r l \ 



1 



return O 



(N, 



against r % under the floating-point model, so that the result 
of r % + u k is just r % . For such i, no random sample is 
being made within the strata. Furthermore, rounding up on 
the same line might easily deliver N — N + 1, not 
N = N as required, if not for the quick-fix use of min! 
This is particularly relevant in the present context because 
contemporary GPUs have significantly faster single-precision 
than double-precision floating-point performance, so that the 
use of single precision is always tempting. In single precision, 
seven significant figures (in decimal) can typically be expected. 



4 



Code 6 Pseudocode for systematic resampling. 



Code 7 Pseudocode for Metropolis resampling. 



Systematic-Cumulative-Offspring(w e 



t,N\ 



1 

2 
3 
4 
5 
6 



u~W[0,l) 

W <- Inclusive-Prefix-Sum(w) 
foreachz e {1, . . . ,iV} 

O l <- 
return O 



min 



(iV, [r l +u\) 



Consider that, with N around one million, almost certainly no 
u k <E [0, 1) is significant against r l at high i. One million 
particles is not an unrealistic number for some contemporary 
applications of the particle filter. In double-precision, however, 
where 15 significant figures (in decimal) is expected, it is 
unlikely that N is sufficiently large for this to be a problem 
in current applications. Note that while pre-sorting weights 
and summing over a binary tree can help with the numerical 
stability of the INCLUSIVE-PREFIX-SUM operation, it does 
not help with this latter issue. 



C. Systematic resampling 

The variance in outcomes of the stratified resampler may 
often, but not always lfT4l . be further reduced by using the 
same random offset when sampling from each stratum. This 
is the systematic resampler (equivalent to the deterministic 
method described in the appendix of 1031 ). Pseudocode is 
given in Code|6] which is a simple modification to Code[5] The 
same complexity and numerical caveats apply to the systematic 
resampler as for the stratified resampler. 

D. Metropolis resampling 

The preceding resamplers all suffer from two problems: 

1) they exhibit numerical instability for large N or large 
weight variance, and 

2) they require a collective operation over the weights, 
specifically Inclusive-Prefix-Sum, which is less 
readily parallelised than independent operations on 
weights. 

We present two alternative approaches, not typically consid- 
ered in the literature, which do not suffer from these problems. 
They are more numerically stable, more readily parallelised, 
and consequently better suited to the breadth of parallelism 
offered by GPU hardware. 

The first produces a multinomial resample via the Metropo- 
lis algorithm [16] rather than by direct sampling. This was 
briefly studied by the first author in ifTTl . but a more complete 
treatment is given here. Instead of the collective operation, 
only the ratio between any two weights is ever computed, 
and only when needed. Code [7] describes the approach. The 
complexity of the algorithm is discussed later. 

The Metropolis resampler is parameterised by B, the num- 
ber of iterations to be performed before convergence is as- 
sumed and each particle may settle on its chosen ancestor. 



Metropolis-Ancestors(w e 
1 foreachi e {1, ...,N} 

k <- i 

for n = 1, . . . , B 

u~«[0,l] 
j~U{l,...,N} 

k 



pJV 



B e N) 



a' <- 
return a 



if u < w J /w 

k -s— i 

- k 



As B must be finite, the algorithm produces a biased sample. 
Selecting B is a tradeoff between speed and bias, with smaller 
B giving faster execution time but larger bias. This bias 
may not be much of a problem for filtering applications, but 
does violate the assumptions that lead to unbiased marginal 
likelihood estimates in a particle Markov chain Monte Carlo 
(PMCMC) framework flJD, so care should be taken. 

We provide guidance as to the selection of B by bounding 
the bias e for the maximum normalised weight p* [17|. One 
may have both an upper bound on unnormalised weights, 
supiu, and an expected weight, E(W), from which this might 
be computed as p* = sup w/(NE(w)) (see { IV for one 



such example, albeit contrived). Otherwise, this might be 
thought of as a tolerance threshold on weight degeneracy and 
a value specified accordingly. Convergence of a Metropolis 
chain depends on a sufficient number of steps being taken 
such that the probability of a particle of this maximum weight 
being selected is within e of p*. 

Following [ 19|[§2.1], for N particles, construct a binary 0-1 
process where state 1 represents being at a particle with the 
maximum weight, and state represents being at any other 
particle. The transition matrix is 



T = 



1 — a a 
P 1-/3 J ' 



where 



1 l-p* 

OL — : 



N p* 

is the probability of moving from state 1 to 0, and 

1 







N 



(3) 



(4) 



(5) 



is the probability of moving from state to 1, with a uniform 
proposal over all N particles and the Metropolis acceptance 
rule. The i?-step transition matrix is then: 



rpti 



\ 



a (3 



X 1 



+ (3 V a P J a + P \ -P P 



(6) 



where A = (1 — a — P). We require that the second term satisfy 
the bias bound: 



a —a 



X 1 



a + p\-p P 



< e, 



(7) 



5 



Code 8 Pseudocode for rejection resampling. 



Rejection-Ancestors(w g 

1 foreachi G {1, ...,N} 

2 j <- i 

3 P~U[0,1] 

4 while j3 > sup w 

5 j~U{l,...,N} 

6 /3~W[0,1] 

7 a 1 -s- j 

8 w l <- 1 

9 return a 



»1V\ 



so 



a 







max(a, /3) < e, 



which is satisfied when 



B > lo g> 



e(a + /3) 
max(a, /3) 



(8) 



(9) 



Thus, by specifying a p* and some small e, one proceeeds 
by calculating a and /3, then A, then finally B as a suggested 
number of steps for the Metropolis resampler. An appropriate 
value for e might be relative to p*, such as p* x 10~ 2 . 



The experiments in I IV provide some empirical evidence to 



support this as a reasonable choice for e, and ultimately B. 

The complexity of the Metropolis resampler is O(NB), but 
B may itself be a function of N and weight variance, as in 
the analysis above. 



as a proposal distribution using REJECTION-ANCESTORS, 
then importance weight each particle i with w 1 <— w a /v a . 
Note that each weight is 1 except where w a > sup v. The 
procedure may also be used when no hard upper bound on 
weights exists (sup w), but where some reasonable substitute 
can be made (supi>). 

An issue unique to the rejection resampler is that the compu- 
tational effort required to draw each ancestor varies, depending 
on the number of rejected proposals before acceptance. This is 
an example of a variable task-length problem [21], particularly 
acute in the GPU context where it causes warp divergence, 
with threads of the same warp tripping the loop on line [4] 
of Code [8] different numbers of times. A persistent threads 
strategy 11221 . ED might be used to mitigate the effects of 
this, although we have not been successful in finding such an 
implementation that does not lose more than it gains through 
additional overhead in register use and branching. 

F. Other algorithms 

The resampling algorithms presented here do not constitute 
an exhaustive list of those in use, but are reasonably represen- 
tative, and tend to form the building blocks of more elaborate 
schemes. A notable example is residual resampling |23|, 
which deterministically draws each particle \_Nw l /Sum(w)J 
number of times before making up the deficit in the number 
of particles by randomly drawing additional particles with 
probabilities proportional to the residuals N 'w l / 'Sum(w) — 
LA^u//Sum(w)J . The first stage may be implemented simi- 
larly to the systematic resampler, and the second using either 
a multinomial, Metropolis or rejection approach. 



E. Rejection resampling 

When an upper bound on the weights is known a priori, 
rejection sampling is possible. Compared to the Metropo- 
lis resampler, the rejection resampler also avoids collective 
operations and their numerical instability, but offers several 
additional advantages: 

1) it is unbiased, 

2) it is simpler to configure, and 

3) it permits a first deterministic proposal (see, e.g., GUI ) 
that a 1 = i, improving the variance in outcomes. 

Pseudocode is given in Code [8] If the third item above is 
removed, rejection resampling is an alternative implementation 
of multinomial resampling. The complexity of the algorithm 
is C(SuM(w)/supw). 

The rejection resampler will perform poorly if supw is 
not a tight upper bound, i.e. if maxfw 1 , . . . , w N ) <C supra. 
While one can empirically set supw = max(w 1 , . . . , w N ), 
this would require a collective operation over weights that 
would defeat the purpose of the approach. 

Performance can be tuned if willing to concede a weighted 
outcome from the resampling step, rather than the usual 
unweighted outcome. To do this, choose some sup v < sup w, 
then form a categorical importance proposal using the weights 
v 1 ,. . . ,v N , where v % = min(w\ supu). Clearly supw forms 
an upper bound on these new weights. Sample from this 



III. Auxiliary functions 

The multinomial, Metropolis and rejection resamplers most 
naturally return the ancestry vector a, while the stratified and 
systematic resamplers return the cumulative offspring vector 
O. Conversion between these is reasonably straightforward. 

An offspring vector o may be converted to a cumulative off- 
spring vector O via the Inclusive-Prefix-Sum primitive, 
and back again via Adjacent-Difference. A cumulative 
offspring vector may be converted to an ancestry vector via 
9\ and an ancestry vector to an offspring vector via 
10] These functions perform well on both CPU and 
GPU. An alternative approach to Cumulative-Offspring- 
To-Ancestors, using a binary search for each ancestor, was 
found to be slower. 

An ancestry vector may be permuted to satisfy OJ. A serial 
algorithm to achieve this is straightforward and given in Code 
11 This O(N) algorithm makes a single pass through the 
ancestry vector with pair-wise swaps to satisfy the condition. 

The simple algorithm is complicated in a parallel context as 
the pair-wise swaps are not readily serialised without heavy- 
weight mutual exclusion. In parallel we propose Code 12 
This algorithm does not perform the permutation in-place, but 
instead produces a new vector c G M that is the permutation 
of the input vector a. It introduces a new vector d G 'R N , 



through which, ultimately, c l 



In the first stage of the 



algorithm, PREPERMUTE, the thread for element i attempts 



6 



Code 9 Pseudocode conversion of an offspring vector o to an 
ancestry vector a. 

Cumulative-Offspring-to-Ancestors(0 e R N ) -> 

R N 

1 foreachi 6 {1,...,N} 

2 if i = l 

3 siart <— 

4 else 

5 start <- O' -1 

6 o ,; <- - start 

7 for j = 1, . . . , o % 

8 a start+j ^_ ^ 

9 return a 



Code 10 Pseudocode conversion of an ancestor vector a to an 
offspring vector o. 



ANCESTORS-TO-OFFSPRING(a G R N ) -> R 

1 o <- 

2 foreachi G {1, ...,N} 

3 atomic o l 4- o 4 + 1 

4 return o 



Code 12 Parallel algorithm for the permutation of an ancestry 
vector to ensure pi. 

PREPERMUTE(a G R^) M W 

// claim places to satisfy <|2p 

1 Let d G and set d l <-N + 1 for i = 1, . . . , N. 

2 foreach i G {1, . . . , N} 

3 atomic d a ' 4- min(d°' , i) 

4 ensures 

5 Vi(i G {1, . . . , iV} : o l > d ,; = minj(a j = i)) 

6 return d 

PERMUTE(a G R N ) -> M W 



1 d 4- PREPERMUTE(a) 

2 foreach i G {1, . . . , N} 

3 x 4- d a * 

4 if x 7^ i 

5 // claim unsuccessful in Prepermute 

6 x <— i 

7 while d x < N 

8 x <- d x 

9 d x <-i 

10 foreach i G {1, . . . , iV} 

11 c J 4- a d ' 



to claim position a* in the output vector by setting d a 4- i. 
By virtue of the min function on line [3] the element of lowest 
index always succeeds in this claim while all others contesting 
the same place fail, and the outcome of the whole permutation 
procedure is deterministic. This is desirable so that the results 
of a particle filter are reproducible for the same pseudorandom 
number seed. For each element i that is not successful in 
its claim, the thread for i instead attempts to claim d\ if 
unsuccessful again then d d ' , then recursively d dd , . . . etc, 
until an unclaimed place is found. Note that this procedure 
is guaranteed to terminate (a proof is given in Appendix [A}. 

Two other procedures are worth mentioning, as they are 
often used in conjunction with resampling. The first, the sort- 
ing of weights for improved numerical stability, has already 
been mentioned. The second is the computation of effective 
sample size (ESS) [24 1, often used to decide whether or not 



Code 11 Serial algorithm for permuting an ancestry to ensure 

a 

PERMUTE(a G R N ) 

1 fori = 1,...,N 

2 if a 1 ^ i and a a ' ^ a 1 

3 swap(a\a a ') 

4 i 4— i — 1 // repeat for new value 

5 ensures 

6 Vi(i G {1, . . . , N} : o l > =*> a 1 = i) 



12 ensures 

13 Vi(z G {1, . . . , N} : o i > c* = i) 

14 return c 



to resample at any given time in the particle filter. The ESS 
is given by Sum(w) 2 /w t w. Note that both sorting and ESS 
are necessarily collective operations. We include these two 
procedures in our timing results to lend additional perspective. 

IV. Experiments 

Weight sets are simulated to assess the speed and accuracy 
of each resampling algorithm. For each number of particles 
N = 2 4 ,2 5 , . . . ,2 20 and observation y = 0, |, 1, l|, . . . ,4, a 
weight set is generated by sampling x l ~ A/"(0, 1), for i = 
1 , . . . , N, and setting 




This is repeated to produce 500 weight sets for each config- 
uration. Note that the construction is analagous to having a 
prior distribution of x ~ A/"(0, 1) and likelihood function of 
y ~ Af(x, 1). As y increases, the relative variance in weights 
does too. 

For this set up, the maximum weight is supw = l/v2~7r, 
and the expected weight 

E(t«) = Af(y; 0,<j 2 = 2) = -^= cx P (-]y 2 ) . (11) 



7 



The variance of the weights is 



1 



Y(w) = — -= exp 

7TV 12 



--y 2 - \E(w)f 



and so their relative variance is 

-r " ■ 2 



: exp 



(12) 



(13) 



which is increasing with y. 

These are used to set B for the Metropolis resampler 



according to the analysis in ^II-D This maximum weight 
is also used for the rejection resampler. Note that, while 
presented with weights here, our actual implementation works 
with log-weights, which we assume is fairly common practice 
for observation densities from the exponential family. 

Experiments are conducted in single-precision on two 
devices. The first device is an eight-core Intel Xeon E5- 
2650 CPU, compiling with the Intel C++ Compiler version 
12.1.3, using OpenMP to parallelise over eight threads, and 
using the Mersenne Twister pseudorandom number genera- 
tor (PRNG) 1 25 1 as implemented in the Boost. Random li- 
brary |26). The second device is an NVIDIA S2050 GPU 
hosted by the same CPU, compiling with CUDA 5.0 and the 
same version of the Intel compiler, and using the XORWOW 
PRNG E2 from the CURAND library [28]. All compiler 
optimisations are applied. 

Figure [3] shows the surface contours of the mean execution 
time across N and y for each algorithm, as well as marking 
up the fastest algorithms on each device, and across both 
devices. Execution times are taken until the delivery of an 
ancestry vector satisfying Q, and so include any of the 
auxiliary functions in pll] necessary to achieve this. Note 
that, as we would expect, execution times of the multinomial, 
stratified and systematic resamplers are not sensitive to y — or 
equivalently to the variance in weights - while the Metropolis 
and rejection resamplers are. 

As the resampling step is performed numerous times 
throughout the particle filter, it is not unreasonable to com- 
pare resampling algorithms on mean execution time alone. 
Nonetheless, some variability in execution time is expected, 
especially for the rejection resampler. This is shown in Figure 
fflby taking a horizontal transect across each surface in Figure 
p] at y = 1, then plotting N versus execution time separately 
for each device. The execution time of the sorting and ESS 
procedures is added to these plots for context. 

To assess the variability in resampling draws for each 
algorithm we compute the mean-square error in the outcome 
for each weight-set, based on the offspring vector o fl5l : 



1 N 
N ^ 



o' 
N 



w 



Sum(w) 



(14) 



We then take the mean of this across all 500 weight sets, 
and the square-root of this, to arrive at root-mean-square error 
(RMSE) as in Figure [5] This figure plots the RMSE for both 
y — 1 and y = 3. Nothing in this figure is surprising: it is 
well known that the stratified resampler reduces variance over 
the multinomial resampler, and that the systematic resampler 
can, but does not necessarily, reduce it further fl4l . Notably 
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Fig. 4. Execution times against log 2 N at y = 1 for (left) the CPU and 
(right) the GPU. All resampling algorithms are shown, along with the SORT 
and ESS procedures for context. 
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Fig. 5. Root-mean-square error (RMSE) of the various resampling algorithms 
at (left) y = 1 and (right) y = 3. 



the RMSE of the Metropolis resampler appears to match that 
of the multinomial resampler, suggesting that the procedure in 



{ II-D for setting the number of steps, B, is appropriate. Also 
of interest is that as y increases, the probability of the initial 
proposal of the rejection resampler being accepted declines, 
and so its RMSE degrades toward that of the multinomial and 
Metropolis resamplers. 

V. Discussion 



On raw speed, the Metropolis and rejection resamplers are 
worth considering for faster execution times on GPU. This is 
largely due to their avoidance of collective operations across 
all weights, which better suits the GPU architecture. They do 
not scale as well in either N or y, however, so that the stratified 
and systematic resamplers are faster at larger values of these. 
The Metropolis resampler scales worst of all in this regard. If 
there exists a good upper bound on the weights, the rejection 
resampler is faster, easier to configure and unbiased with 
respect to the Metropolis resampler. The Metropolis resampler 
is more configurable, however, and allows the tuning of B 
to trade off between execution time and bias. This may be 
particularly advantageous for applications with hard execution 
time constraints, such as real-time object tracking. Recall that 
the rejection resampler is somewhat configurable by using an 
approximate maximum weight, if one is willing to accept a 
still-weighted output from the resampling step. 
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Fig. 3. Mean execution times for algorithms on each device, (top row) CPU and (bottom row) GPU. Each surface shows the base-10 logarithm of mean 
execution time across log 2 N (where N is the number of particles) and observation y, with prior distribution over particles x ~ JV(0, 1) and likelihood 
function y ~ J\f(x, 1). Dashed lines demarcate the region, if any, where an algorithm is the fastest in its row. Solid lines demarcate where each algorithm is 
the fastest overall. 



A further consideration is that the execution time of both 
the Metropolis and rejection resamplers depends on the PRNG 
used. This dependence is by a constant factor, but can be 
substantial. Here, robust PRNGs for Monte Carlo work have 
been used, but conceivably cheaper, if less robust, PRNGs 
might be considered as another trade-off between execution 
time and bias. 

One should still be aware that the stratified resampler is 
variance-reducing with respect to the multinomial resampler, 
and that the systematic resampler can give further reduc- 
tions fl4l . This may be a more important consideration 
than raw speed, depending on the application. The rejection 
resampler, with its first deterministic proposal, has the nice 
property that it can also improve upon the variance of the 
multinomial resampler. The Metropolis resampler is expected 
to match the multinomial resampler on variance, and indeed 
any variation can only be attributed to the bias introduced by 
finite B. 

A special consideration on GPUs is the use of single- 
precision floating-point operations to improve execution time. 
For a number of particles upwards of hundreds of thousands, 
plausible in modern applications, it is worth emphasising again 
that great care should be taken in the use of the stratified 
and systematic resamplers due to numerical instability. The 
Metropolis and rejection resamplers have better numerical 
properties, as they compute only ratios of weights. In dou- 
ble precision, the number of particles required to have the 
same issue far surpasses that which is realistic at present, so 
numerical stability is unlikely to be an issue. 
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Appendix 



We offer a proof of the termination of Code 12 
First note that PREPERMUTE leaves d in a state where, 
excluding all values of 7Y+1, the remaining values are unique. 
Furthermore, in PERMUTE the conditional on line|4]means that 
the loop on line [7] is only entered for values of i that are not 
represented in d. 

For each such i, the while loop traverses the sequence xq = 
i, x n = d x "- 1 , until d Xn = N + 1. For the procedure to 
terminate this sequence must be finite. Because each x n is an 
element of the finite set {1, ... , N}, to show that the sequence 
is finite it is sufficient to show that it never revisits the same 
value twice. The proof is by induction. 

1) As no value of d is i, the sequence cannot revisit its 
initial value xo = i. The element xq is therefore unique. 

2) For k > 1, assume that the elements of xoifc-i are 
unique. 

3) Now, xo-.k is not unique if there exists some j E 
{l,...,k- 1} such that x k = d Xk - 1 = x 3 = cf^-i, 
with Xj-i 7^ Xk-i by the uniqueness of xq-^-i- But this 
contradicts the uniqueness of the (non N + 1) values of 
d. Thus the elements of XQ : k are unique, the sequence 
is finite, and the program must terminate. □ 



